In this studio, we're going to play around with some functions of interoperability - or, explore different data structures and how to make them connect up.
I find this kind of data science really frustrating (even though I work in data visualization), because it is really unintiuitive. It requires a weird kind of imagining, so if you feel like you're losing your footing, don't worry, you're not alone. It's vertiginous!
But, it means that we get to roll together some of the ideas we learnt about in our Tables studio, and our Algorithms studio, and prepare for our Images studio in a few weeks time.
Let's start by going back and getting some data from our last python studio.
Sign in to reddit using Google Chrome in a separate tab.
Then go to this page: https://www.reddit.com/prefs/apps
You should already have an app. If you don't, click create app

In the form that will open, you should enter your name, description and uri. For the redirect uri you should choose http://localhost:8080

Now, let's import our packages and set up our API connection. You need to fill out your own ID details!
!pip install praw
!pip install pandas
import praw
import pandas as pd
reddit = praw.Reddit(client_id="B4e40eV22k9nfQ", # your client id
client_secret="-yRZxz2GBB6H5A5gd9_U1s-p8wMc8Q", #your client secret
user_agent="android:com.example.myredditapp:v1.2.3 (by u/alolarose)", #user agent name
username = "alolarose", # your reddit username
password = "ACCIO123a!") # your reddit password
print(reddit)
Requirement already satisfied: praw in c:\users\afelix\anaconda3\lib\site-packages (7.2.0) Requirement already satisfied: update-checker>=0.18 in c:\users\afelix\anaconda3\lib\site-packages (from praw) (0.18.0) Requirement already satisfied: prawcore<3,>=2 in c:\users\afelix\anaconda3\lib\site-packages (from praw) (2.0.0) Requirement already satisfied: websocket-client>=0.54.0 in c:\users\afelix\anaconda3\lib\site-packages (from praw) (0.58.0) Requirement already satisfied: requests<3.0,>=2.6.0 in c:\users\afelix\anaconda3\lib\site-packages (from prawcore<3,>=2->praw) (2.24.0) Requirement already satisfied: idna<3,>=2.5 in c:\users\afelix\anaconda3\lib\site-packages (from requests<3.0,>=2.6.0->prawcore<3,>=2->praw) (2.10) Requirement already satisfied: certifi>=2017.4.17 in c:\users\afelix\anaconda3\lib\site-packages (from requests<3.0,>=2.6.0->prawcore<3,>=2->praw) (2020.6.20) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\users\afelix\anaconda3\lib\site-packages (from requests<3.0,>=2.6.0->prawcore<3,>=2->praw) (1.25.11) Requirement already satisfied: chardet<4,>=3.0.2 in c:\users\afelix\anaconda3\lib\site-packages (from requests<3.0,>=2.6.0->prawcore<3,>=2->praw) (3.0.4) Requirement already satisfied: six in c:\users\afelix\anaconda3\lib\site-packages (from websocket-client>=0.54.0->praw) (1.15.0) Requirement already satisfied: pandas in c:\users\afelix\anaconda3\lib\site-packages (1.1.3) Requirement already satisfied: numpy>=1.15.4 in c:\users\afelix\anaconda3\lib\site-packages (from pandas) (1.19.2) Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\afelix\anaconda3\lib\site-packages (from pandas) (2.8.1) Requirement already satisfied: pytz>=2017.2 in c:\users\afelix\anaconda3\lib\site-packages (from pandas) (2020.1) Requirement already satisfied: six>=1.5 in c:\users\afelix\anaconda3\lib\site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0) <praw.reddit.Reddit object at 0x000001F17E836EB0>
Now, let's scrape our subreddit. In the sub section you can choose your subreddit, and then use query to run a search term.
At the end we'll convert it into a panda data frame called "post_data" which we will use for later gymnastics, and save it to CSV for good measure.
sub = ['nsfwasmr'] # your subreddit
for s in sub:
subreddit = reddit.subreddit(s)
query = ['vibrator']
for item in sub:
posts = {
"title" : [], #title of the post
"score" : [], # score of the post
"id" : [], # unique id of the post
"url" : [], #url of the post
"comms_num": [], #the number of comments on the post
"created" : [], #timestamp of the post
"upvote_ratio" : [], # the description of post
"body" : [] #the body of the post
}
for submission in subreddit.search(query,sort = "top",limit = 1000): #max 1k
posts["title"].append(submission.title)
posts["score"].append(submission.score)
posts["id"].append(submission.id)
posts["url"].append(submission.url)
posts["comms_num"].append(submission.num_comments)
posts["created"].append(submission.created_utc)
posts["upvote_ratio"].append(submission.upvote_ratio)
posts["body"].append(submission.selftext)
post_data = pd.DataFrame(posts)
post_data.to_csv(s+"_"+ item +"subreddit.csv")
print(subreddit)
nsfwasmr
For more info on the parameters you can request for a submission, see: http://lira.no-ip.org:8080/doc/praw-doc/html/code_overview/models/submission.html
This next section, we're going to get used to different computational types and how they work together.
Let's see what our post_data from Reddit looks like:
post_data.head()
| title | score | id | url | comms_num | created | upvote_ratio | body | |
|---|---|---|---|---|---|---|---|---|
| 0 | ASMR N3tw0rk vibrating yoga mat (video works now) | 156 | duws2y | http://dupose.com/asmr-network-yoga-instructor... | 18 | 1.573498e+09 | 0.88 | |
| 1 | "Hysterical Literature: Session Twelve" (women... | 89 | 4193c2 | https://www.youtube.com/watch?v=-_8-_NoXml0 | 2 | 1.452962e+09 | 0.95 | |
| 2 | cum clinic vibrator treatment | 69 | 8hoh3z | https://m.spankbang.com/261mg/video/cum+clinic... | 6 | 1.525708e+09 | 0.92 | |
| 3 | [moaning] [accents] [vibrator] "cruel" orgasm ... | 60 | 6txoe4 | https://www.pornhub.com/view_video.php?viewkey... | 2 | 1.502836e+09 | 0.96 | |
| 4 | Girl's reading distracted by vibrator in her v... | 52 | fwl5qe | https://eroasmr.com/video/girls-reading-distra... | 9 | 1.586269e+09 | 0.87 |
Different data types have different properties which allow them to do things, or not do things. For instance, you can't plot a character on a graph.
In Python, these are the main data types (thanks to Shawn Ren for the graph):

So, let's check out the data types of our post_data data set:
print(post_data.dtypes)
We're seeing a lot of Python/Panda objects (because this is a dataframe, and which we will need to convert to use), but also some integers and floating points, which are numeric forms. This is awesome!
So, let's try plotting some data using matplotlib's pyplot. Most digital images are Cartesian (like maps!), meaning that they work on an x,y axis, where each pixel is assigned an x,y coordinate. This coordinate system, called algebraic geometry, combines spatial measurement forms with numeric forms.

So, you can set any of the int64 or float74 values against each other:
!pip install matplotlib
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
# scatter the comments against the likes
ax.scatter(post_data['upvote_ratio'], post_data['comms_num']) #format (dataframe1(column), dataframe2(column))
# set a title and labels
ax.set_title('Number of Comments vs Upvote Ratio')
ax.set_xlabel('No of Upvotes to Downvotes')
ax.set_ylabel('number of comments')
Requirement already satisfied: matplotlib in c:\users\afelix\anaconda3\lib\site-packages (3.3.2) Requirement already satisfied: certifi>=2020.06.20 in c:\users\afelix\anaconda3\lib\site-packages (from matplotlib) (2020.6.20) Requirement already satisfied: cycler>=0.10 in c:\users\afelix\anaconda3\lib\site-packages (from matplotlib) (0.10.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\afelix\anaconda3\lib\site-packages (from matplotlib) (1.3.0) Requirement already satisfied: python-dateutil>=2.1 in c:\users\afelix\anaconda3\lib\site-packages (from matplotlib) (2.8.1) Requirement already satisfied: pillow>=6.2.0 in c:\users\afelix\anaconda3\lib\site-packages (from matplotlib) (8.0.1) Requirement already satisfied: numpy>=1.15 in c:\users\afelix\anaconda3\lib\site-packages (from matplotlib) (1.19.2) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\afelix\anaconda3\lib\site-packages (from matplotlib) (2.4.7) Requirement already satisfied: six in c:\users\afelix\anaconda3\lib\site-packages (from cycler>=0.10->matplotlib) (1.15.0)
Text(0, 0.5, 'number of comments')
Okay, cool. But what about the time of the post. Take a look at the "created" column - this is a time stamp in Unix time, which is a universal time that is free from timezones:
Unix time (a.k.a. POSIX time or Epoch time) is a system for describing instants in time, defined as the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970, not counting leap seconds. It is used widely in Unix-like and many other operating systems and file formats. Due to its handling of leap seconds, it is neither a linear representation of time nor a true representation of UTC.
We're going to need to bring that into something readable to humans! So, let's convert it, and make sure that it's in a datetime format and that it looks about right to the human eye:
post_data["created"]= pd.to_datetime(post_data["created"], yearfirst=True, unit="s")
print(post_data.dtypes)
post_data.head()
title object score int64 id object url object comms_num int64 created datetime64[ns] upvote_ratio float64 body object dtype: object
| title | score | id | url | comms_num | created | upvote_ratio | body | |
|---|---|---|---|---|---|---|---|---|
| 0 | ASMR N3tw0rk vibrating yoga mat (video works now) | 156 | duws2y | http://dupose.com/asmr-network-yoga-instructor... | 18 | 2019-11-11 18:43:54 | 0.88 | |
| 1 | "Hysterical Literature: Session Twelve" (women... | 89 | 4193c2 | https://www.youtube.com/watch?v=-_8-_NoXml0 | 2 | 2016-01-16 16:37:04 | 0.95 | |
| 2 | cum clinic vibrator treatment | 69 | 8hoh3z | https://m.spankbang.com/261mg/video/cum+clinic... | 6 | 2018-05-07 15:45:47 | 0.92 | |
| 3 | [moaning] [accents] [vibrator] "cruel" orgasm ... | 60 | 6txoe4 | https://www.pornhub.com/view_video.php?viewkey... | 2 | 2017-08-15 22:20:33 | 0.96 | |
| 4 | Girl's reading distracted by vibrator in her v... | 52 | fwl5qe | https://eroasmr.com/video/girls-reading-distra... | 9 | 2020-04-07 14:10:40 | 0.87 |
Now, let's plot the date compared to the number of comments?
fig, ax = plt.subplots()
# scatter the comments against the likes
ax.scatter(post_data['created'], post_data['comms_num'])
# set a title and labels
ax.set_xlabel('Date')
ax.set_ylabel('Number of Comments on Posts')
Text(0, 0.5, 'Number of Comments on Posts')
So, we've been using the useful and classic matplotlib to do our graphics. But it's not really the best. Let's try another and see if we can get some more information. Let's use plotly
!pip install plotly==4.14.3
Requirement already satisfied: plotly==4.14.3 in c:\users\afelix\anaconda3\lib\site-packages (4.14.3) Requirement already satisfied: six in c:\users\afelix\anaconda3\lib\site-packages (from plotly==4.14.3) (1.15.0) Requirement already satisfied: retrying>=1.3.3 in c:\users\afelix\anaconda3\lib\site-packages (from plotly==4.14.3) (1.3.3)
import plotly.express as px
fig = px.scatter(post_data, x="created", y="upvote_ratio", size="comms_num", color="score", hover_name="title", size_max=60)
fig.show()
I'm not going to bore you with more graphs - but when you're feeling up to it, feel free to take a look at the different kinds of charts you can make and have a play around - you could even combine several reddit datasets!
Now, let's turn to something a little more complicated, with some reflections on Wernimont's piece on the Quantified Self and explore some of the ways in which our bodies are made data.
I've located and exported my own (seriously incomplete, and didn't even realise I had authorised it) health data from my iPhone's Health App for a laugh.
When downloaded, this comes in a .zip format. When expanded, you get two files - export.xml is the one that we want.
XML, like geojson is good format for holding together different types of data in the same document (like we learned with geojson). But it's not super useful for python, so we're going to run the apple-health-data-parser created by Nicholas Radcliffe to "parse" or separate out the data into different CSV files. Then we can have a little look at it more closely.
Normally, we would run a .py file using the command line (like terminal), but Jupyter is friendly, and actually lets us run .py files like a command line from inside the notebook! So, making sure that the following are in the same folder (which they will be if you have downloaded this from github) - Interoperability_Studio.ipynb, apple-heath-data-parser.py and export.xml, let's try to do some parsing!
# %run -i 'apple-health-data-parser' 'export.xml'
%run -i "apple-health-data-parser" "export.xml"
Reading data from export.xml . . . done Unexpected node of type ExportDate. Tags: ExportDate: 1 Me: 1 Record: 50795 Fields: HKCharacteristicTypeIdentifierBiologicalSex: 1 HKCharacteristicTypeIdentifierBloodType: 1 HKCharacteristicTypeIdentifierDateOfBirth: 1 HKCharacteristicTypeIdentifierFitzpatrickSkinType: 1 creationDate: 50795 device: 50795 endDate: 50795 sourceName: 50795 sourceVersion: 50795 startDate: 50795 type: 50795 unit: 50795 value: 50796 Record types: DistanceWalkingRunning: 23826 FlightsClimbed: 3133 HeadphoneAudioExposure: 8 StepCount: 23828 Opening C:\Users\afelix\Documents\Writing\PHD\Year 1\Digital Geographies (SP21)\Studio\interoperability_studio-20210309T181550Z-001\interoperability_studio\StepCount.csv for writing Opening C:\Users\afelix\Documents\Writing\PHD\Year 1\Digital Geographies (SP21)\Studio\interoperability_studio-20210309T181550Z-001\interoperability_studio\DistanceWalkingRunning.csv for writing Opening C:\Users\afelix\Documents\Writing\PHD\Year 1\Digital Geographies (SP21)\Studio\interoperability_studio-20210309T181550Z-001\interoperability_studio\FlightsClimbed.csv for writing Opening C:\Users\afelix\Documents\Writing\PHD\Year 1\Digital Geographies (SP21)\Studio\interoperability_studio-20210309T181550Z-001\interoperability_studio\HeadphoneAudioExposure.csv for writing Written StepCount data. Written DistanceWalkingRunning data. Written FlightsClimbed data. Written HeadphoneAudioExposure data.
Awesome! Looks like like Apple has been secretly collecting four kinds of my data: flights of stairs climbed, how often and loudly I use my headphones, my step count and how far I walk. Let's explore some of this data.
We start by installing (if we haven't already) 3 libraries: numpy (or nummber python, num-py), pandas (our much loved data format), and glob, which helps us find data paths on our computers, the pytz time zone calculator, pyplot for making graphs, and datetime, which does as it says.
import numpy as np
import pandas as pd
import glob
import matplotlib.pyplot as plt
%matplotlib inline
from datetime import date, datetime, timedelta as td
import pytz
Okay, let's see what this data is all about.
steps = pd.read_csv("StepCount.csv") #use pandas (pd) to read the csv file
steps.head() #have a look at the top row of data
| sourceName | sourceVersion | device | type | unit | creationDate | startDate | endDate | value | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Clancy Wilmott’s iPhone | 10.3.3 | <<HKDevice: 0x283060eb0>, name:iPhone, manufac... | StepCount | count | 2017-09-20 00:58:31 -0800 | 2017-09-20 00:10:26 -0800 | 2017-09-20 00:16:38 -0800 | 11 |
| 1 | Clancy Wilmott’s iPhone | 10.3.3 | <<HKDevice: 0x283061680>, name:iPhone, manufac... | StepCount | count | 2017-09-20 00:58:31 -0800 | 2017-09-20 00:44:40 -0800 | 2017-09-20 00:50:52 -0800 | 8 |
| 2 | Clancy Wilmott’s iPhone | 10.3.3 | <<HKDevice: 0x283061770>, name:iPhone, manufac... | StepCount | count | 2017-09-20 02:03:34 -0800 | 2017-09-20 01:02:29 -0800 | 2017-09-20 01:11:25 -0800 | 100 |
| 3 | Clancy Wilmott’s iPhone | 10.3.3 | <<HKDevice: 0x283061860>, name:iPhone, manufac... | StepCount | count | 2017-09-20 02:03:34 -0800 | 2017-09-20 01:17:15 -0800 | 2017-09-20 01:24:21 -0800 | 97 |
| 4 | Clancy Wilmott’s iPhone | 10.3.3 | <<HKDevice: 0x283061950>, name:iPhone, manufac... | StepCount | count | 2017-09-20 03:05:00 -0800 | 2017-09-20 01:58:00 -0800 | 2017-09-20 02:07:54 -0800 | 100 |
And check out what kind of data we're working with here:
print(steps.dtypes)
sourceName object sourceVersion object device object type object unit object creationDate object startDate object endDate object value int64 dtype: object
Lots of objects, again, and some messy time formats too. Let's clean up. We need to start with date-time - the data crosses a few timezones, I think, but I want to bring it into the one I'm in now - America/Los_Angeles.
# functions to convert UTC to LA time zone and extract date/time elements
convert_tz = lambda x: x.to_pydatetime().replace(tzinfo=pytz.utc).astimezone(pytz.timezone('America/Los_Angeles'))
get_year = lambda x: convert_tz(x).year
get_month = lambda x: '{}-{:02}'.format(convert_tz(x).year, convert_tz(x).month) #inefficient
get_date = lambda x: '{}-{:02}-{:02}'.format(convert_tz(x).year, convert_tz(x).month, convert_tz(x).day) #inefficient
get_day = lambda x: convert_tz(x).day
get_hour = lambda x: convert_tz(x).hour
get_minute = lambda x: convert_tz(x).minute
get_day_of_week = lambda x: convert_tz(x).weekday()
Now, let's "parse" (or separate) out the different time sections:
# parse out date and time elements as LA time
steps['startDate'] = pd.to_datetime(steps['startDate'])
steps['year'] = steps['startDate'].map(get_year)
steps['month'] = steps['startDate'].map(get_month)
steps['date'] = steps['startDate'].map(get_date)
steps['day'] = steps['startDate'].map(get_day)
steps['hour'] = steps['startDate'].map(get_hour)
steps['dow'] = steps['startDate'].map(get_day_of_week)
And check it's lookin' good!
steps.head()
| sourceName | sourceVersion | device | type | unit | creationDate | startDate | endDate | value | year | month | date | day | hour | dow | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Clancy Wilmott’s iPhone | 10.3.3 | <<HKDevice: 0x283060eb0>, name:iPhone, manufac... | StepCount | count | 2017-09-20 00:58:31 -0800 | 2017-09-20 00:10:26-08:00 | 2017-09-20 00:16:38 -0800 | 11 | 2017 | 2017-09 | 2017-09-19 | 19 | 17 | 1 |
| 1 | Clancy Wilmott’s iPhone | 10.3.3 | <<HKDevice: 0x283061680>, name:iPhone, manufac... | StepCount | count | 2017-09-20 00:58:31 -0800 | 2017-09-20 00:44:40-08:00 | 2017-09-20 00:50:52 -0800 | 8 | 2017 | 2017-09 | 2017-09-19 | 19 | 17 | 1 |
| 2 | Clancy Wilmott’s iPhone | 10.3.3 | <<HKDevice: 0x283061770>, name:iPhone, manufac... | StepCount | count | 2017-09-20 02:03:34 -0800 | 2017-09-20 01:02:29-08:00 | 2017-09-20 01:11:25 -0800 | 100 | 2017 | 2017-09 | 2017-09-19 | 19 | 18 | 1 |
| 3 | Clancy Wilmott’s iPhone | 10.3.3 | <<HKDevice: 0x283061860>, name:iPhone, manufac... | StepCount | count | 2017-09-20 02:03:34 -0800 | 2017-09-20 01:17:15-08:00 | 2017-09-20 01:24:21 -0800 | 97 | 2017 | 2017-09 | 2017-09-19 | 19 | 18 | 1 |
| 4 | Clancy Wilmott’s iPhone | 10.3.3 | <<HKDevice: 0x283061950>, name:iPhone, manufac... | StepCount | count | 2017-09-20 03:05:00 -0800 | 2017-09-20 01:58:00-08:00 | 2017-09-20 02:07:54 -0800 | 100 | 2017 | 2017-09 | 2017-09-19 | 19 | 18 | 1 |
Coolios - as you can see above, EVERYTHING IS NUMBERS. SEPARATE CATEGORISED NUMBERS. What are those categories, you ask?
steps.columns
Index(['sourceName', 'sourceVersion', 'device', 'type', 'unit', 'creationDate',
'startDate', 'endDate', 'value', 'year', 'month', 'date', 'day', 'hour',
'dow'],
dtype='object')
We can create some groups for each date, to see how many each day.
steps_by_date = steps.groupby(['date'])['value'].sum().reset_index(name='Steps')
steps_by_date.head()
| date | Steps | |
|---|---|---|
| 0 | 2017-09-19 | 3453 |
| 1 | 2017-09-20 | 13298 |
| 2 | 2017-09-21 | 10819 |
| 3 | 2017-09-22 | 1221 |
| 4 | 2017-09-23 | 6682 |
Now, let's save it to CSV for good measure, and so we can start visualising!
steps_by_date.to_csv("steps_per_day.csv", index=False)
Time to turn numbers back into images.
steps_by_date['RollingMeanSteps'] = steps_by_date.Steps.rolling(window=10, center=True).mean()
steps_by_date.plot(x='date', y='RollingMeanSteps', title= 'Daily step counts rolling mean over 10 days', figsize=[10, 6])
<AxesSubplot:title={'center':'Daily step counts rolling mean over 10 days'}, xlabel='date'>
What about weekday? Let's regroup our CSV and see what we find?
#regroup
steps_by_date['date'] = pd.to_datetime(steps_by_date['date'])
steps_by_date['dow'] = steps_by_date['date'].dt.weekday
#plot
data = steps_by_date.groupby(['dow'])['Steps'].mean()
fig, ax = plt.subplots(figsize=[10, 6])
ax = data.plot(kind='bar', x='day_of_week')
n_groups = len(data)
index = np.arange(n_groups)
opacity = 0.75
#fig, ax = plt.subplots(figsize=[10, 6])
ax.yaxis.grid(True)
plt.suptitle('Average Steps by Day of the Week', fontsize=16)
dow_labels = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
plt.xticks(index, dow_labels, rotation=45)
plt.xlabel('Day of Week', fontsize=12, color='red')
Text(0.5, 0, 'Day of Week')
What about hours (bearing in mind time zones)
hour_steps = steps.groupby(['hour'])['value'].sum().reset_index(name='Steps')
ax = hour_steps.Steps.plot(kind='line', figsize=[10, 5], linewidth=4, alpha=1, marker='o', color='#6684c1',
markeredgecolor='#6684c1', markerfacecolor='w', markersize=8, markeredgewidth=2)
xlabels = hour_steps.index.map(lambda x: '{:02}:00'.format(x))
ax.set_xticks(range(len(xlabels)))
ax.set_xticklabels(xlabels, rotation=45, rotation_mode='anchor', ha='right')
# ax.set_xlim((hour_steps.index[0], hour_steps.index[-1]))
ax.yaxis.grid(True)
# ax.set_ylim((0, 1300))
ax.set_ylabel('Steps')
ax.set_xlabel('')
ax.set_title('Steps by hour the day')
plt.show()
Let's combine the numeric representation of my lived mobilities. What about flights?
flights = pd.read_csv("FlightsClimbed.csv") #use pandas (pd) to read the csv file
flights.head() #have a look at the top row of data
| sourceName | sourceVersion | device | type | unit | creationDate | startDate | endDate | value | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Clancy Wilmott’s iPhone | 10.3.3 | <<HKDevice: 0x2830d0690>, name:iPhone, manufac... | FlightsClimbed | count | 2017-09-20 04:02:51 -0800 | 2017-09-20 03:38:30 -0800 | 2017-09-20 03:38:30 -0800 | 1 |
| 1 | Clancy Wilmott’s iPhone | 10.3.3 | <<HKDevice: 0x2830d1130>, name:iPhone, manufac... | FlightsClimbed | count | 2017-09-20 06:59:09 -0800 | 2017-09-20 06:10:39 -0800 | 2017-09-20 06:10:39 -0800 | 1 |
| 2 | Clancy Wilmott’s iPhone | 10.3.3 | <<HKDevice: 0x2830d0e60>, name:iPhone, manufac... | FlightsClimbed | count | 2017-09-20 10:58:44 -0800 | 2017-09-20 10:15:03 -0800 | 2017-09-20 10:15:03 -0800 | 1 |
| 3 | Clancy Wilmott’s iPhone | 10.3.3 | <<HKDevice: 0x2830d3570>, name:iPhone, manufac... | FlightsClimbed | count | 2017-09-20 12:13:10 -0800 | 2017-09-20 11:02:53 -0800 | 2017-09-20 11:02:53 -0800 | 1 |
| 4 | Clancy Wilmott’s iPhone | 10.3.3 | <<HKDevice: 0x2830d3f20>, name:iPhone, manufac... | FlightsClimbed | count | 2017-09-21 00:13:37 -0800 | 2017-09-21 00:00:28 -0800 | 2017-09-21 00:00:28 -0800 | 1 |
Let's parse it out again
# parse out date and time elements as LA time
flights['startDate'] = pd.to_datetime(flights['startDate'])
flights['year'] = flights['startDate'].map(get_year)
flights['month'] = flights['startDate'].map(get_month)
flights['date'] = flights['startDate'].map(get_date)
flights['day'] = flights['startDate'].map(get_day)
flights['hour'] = flights['startDate'].map(get_hour)
flights['dow'] = flights['startDate'].map(get_day_of_week)
And group it into dates
flights_by_date = flights.groupby(['date'])['value'].sum().reset_index(name='Flights')
flights_by_date.head()
| date | Flights | |
|---|---|---|
| 0 | 2017-09-19 | 2 |
| 1 | 2017-09-20 | 10 |
| 2 | 2017-09-21 | 5 |
| 3 | 2017-09-22 | 2 |
| 4 | 2017-09-23 | 2 |
And save...
flights_by_date.to_csv("flights_by_date.csv", index=False)
flights_by_date['date'] = pd.to_datetime(flights_by_date['date'])
flights_by_date['dow'] = flights_by_date['date'].dt.weekday
#plot
data = flights_by_date.groupby(['dow'])['Flights'].mean()
fig, ax = plt.subplots(figsize=[10, 6])
ax = data.plot(kind='bar', x='day_of_week')
n_groups = len(data)
index = np.arange(n_groups)
opacity = 0.75
#fig, ax = plt.subplots(figsize=[10, 6])
ax.yaxis.grid(True)
plt.suptitle('Average Flights by Day of the Week', fontsize=16)
dow_labels = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
plt.xticks(index, dow_labels, rotation=45)
plt.xlabel('Day of Week', fontsize=12, color='red')
Text(0.5, 0, 'Day of Week')
To be totally ridiculous, let's compare how many steps I take per day of the week, compared to how many comments on your chosen subreddit.
First, we need to parse out (again) the time/date data. Then, it's just like above, using "groupby", while paying attention to the column headers.
post_data.head()
| title | score | id | url | comms_num | created | upvote_ratio | body | |
|---|---|---|---|---|---|---|---|---|
| 0 | ASMR N3tw0rk vibrating yoga mat (video works now) | 156 | duws2y | http://dupose.com/asmr-network-yoga-instructor... | 18 | 2019-11-11 18:43:54 | 0.88 | |
| 1 | "Hysterical Literature: Session Twelve" (women... | 89 | 4193c2 | https://www.youtube.com/watch?v=-_8-_NoXml0 | 2 | 2016-01-16 16:37:04 | 0.95 | |
| 2 | cum clinic vibrator treatment | 69 | 8hoh3z | https://m.spankbang.com/261mg/video/cum+clinic... | 6 | 2018-05-07 15:45:47 | 0.92 | |
| 3 | [moaning] [accents] [vibrator] "cruel" orgasm ... | 60 | 6txoe4 | https://www.pornhub.com/view_video.php?viewkey... | 2 | 2017-08-15 22:20:33 | 0.96 | |
| 4 | Girl's reading distracted by vibrator in her v... | 52 | fwl5qe | https://eroasmr.com/video/girls-reading-distra... | 9 | 2020-04-07 14:10:40 | 0.87 |
# parse out date and time elements as LA time
post_data['created'] = pd.to_datetime(post_data['created'])
post_data['year'] = post_data['created'].map(get_year)
post_data['month'] = post_data['created'].map(get_month)
post_data['date'] = post_data['created'].map(get_date)
post_data['day'] = post_data['created'].map(get_day)
post_data['hour'] = post_data['created'].map(get_hour)
post_data['dow'] = post_data['created'].map(get_day_of_week)
post_data.head()
| title | score | id | url | comms_num | created | upvote_ratio | body | year | month | date | day | hour | dow | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ASMR N3tw0rk vibrating yoga mat (video works now) | 156 | duws2y | http://dupose.com/asmr-network-yoga-instructor... | 18 | 2019-11-11 18:43:54 | 0.88 | 2019 | 2019-11 | 2019-11-11 | 11 | 10 | 0 | |
| 1 | "Hysterical Literature: Session Twelve" (women... | 89 | 4193c2 | https://www.youtube.com/watch?v=-_8-_NoXml0 | 2 | 2016-01-16 16:37:04 | 0.95 | 2016 | 2016-01 | 2016-01-16 | 16 | 8 | 5 | |
| 2 | cum clinic vibrator treatment | 69 | 8hoh3z | https://m.spankbang.com/261mg/video/cum+clinic... | 6 | 2018-05-07 15:45:47 | 0.92 | 2018 | 2018-05 | 2018-05-07 | 7 | 8 | 0 | |
| 3 | [moaning] [accents] [vibrator] "cruel" orgasm ... | 60 | 6txoe4 | https://www.pornhub.com/view_video.php?viewkey... | 2 | 2017-08-15 22:20:33 | 0.96 | 2017 | 2017-08 | 2017-08-15 | 15 | 15 | 1 | |
| 4 | Girl's reading distracted by vibrator in her v... | 52 | fwl5qe | https://eroasmr.com/video/girls-reading-distra... | 9 | 2020-04-07 14:10:40 | 0.87 | 2020 | 2020-04 | 2020-04-07 | 7 | 7 | 1 |
f_df = flights_by_date.groupby(['dow'])['Flights'].sum()
s_df = steps_by_date.groupby(['dow'])['Steps'].median()
fig, ax = plt.subplots(figsize=[10, 6])
f_ax = f_df.plot(kind='line', x='day_of_week')
s_ax = s_df.plot(kind='line', x='day_of_week')
#fig, ax = plt.subplots(figsize=[10, 6])
ax.yaxis.grid(True)
plt.suptitle('Steps VS Reddit Comments', fontsize=16)
dow_labels = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
plt.xticks(index, dow_labels, rotation=45)
plt.xlabel('Day of Week', fontsize=12, color='green')
Text(0.5, 0, 'Day of Week')
Have a mess around in your own time - compare, means to medians, and ask your friends in data science what it's all about, because honestly, it's just a strange kind of magic.
Turning back to our subreddit, and channelling cultural analytics, let's look a little more closely at some text analysis and see what we can do!
Text works as a str or string:
A word is a string of individual letters, a sentence is a string of words!
(Strings are used a lot in the Digital Humanities and Text Processing - I'm a geographer, and still learning about strings, so bear with me!)
Let's start by grabbing a cell with an object from our post_data dataset. With a pandas data frame, everything works on a gridded position as well! You can use iloc (or location by position) to find particular cells. Let's start with row number 3:
post_data.iloc[21]
title looking for video score 12 id 3zqbj9 url https://www.reddit.com/r/nsfwasmr/comments/3zq... comms_num 3 created 2016-01-06 15:48:25 upvote_ratio 1 body I'm looking for a specific video that was post... year 2016 month 2016-01 date 2016-01-06 day 6 hour 7 dow 2 Name: 21, dtype: object
Now, if you count down the list, body is number "7", so let's add that to get the cell.
post_data.iloc[21,7]
"I'm looking for a specific video that was posted on this subreddit a while ago. it was more artistic and sensual than raunchy. the video was of a woman using a small vibrater all the way until cumming. the video had a clock sound going in the background. it was really good. I know that isn't a lot to go on, but does anyone know where to find this video? and if possible who makes it? (so I can find more similar content)"
Now, let's convert it from a panda object to a string, and give it a name, so we can do some analysis:
cell = str(post_data.iloc[21,7])
print(cell)
I'm looking for a specific video that was posted on this subreddit a while ago. it was more artistic and sensual than raunchy. the video was of a woman using a small vibrater all the way until cumming. the video had a clock sound going in the background. it was really good. I know that isn't a lot to go on, but does anyone know where to find this video? and if possible who makes it? (so I can find more similar content)
We can count how many characters are in the string:
len(cell) #len = length
422
Or what the 'n' letter of the string is (in the below example, 45th)
cell[45]
't'
If we wanted to be braver, we could even try to count the most common words all the posts in the "title" column:
from collections import Counter
Counter(" ".join(post_data["title"]).split()).most_common(20)
[('|', 17),
('vibrator', 9),
('a', 8),
('ASMR', 7),
('in', 7),
('-', 6),
('Vibrator', 5),
('my', 5),
('while', 4),
('Whispers', 4),
('the', 4),
('reading', 3),
('on', 3),
('her', 3),
('to', 3),
('vibrating', 2),
('cum', 2),
('[vibrator]', 2),
('Ginny', 2),
('Redhead', 2)]
So, there are many "to", "the", "of" .... These are called "stopwords". Let's create a new column with all the stopwords deleted so we can count again.
To do this we import an nltk dictionary which has a list of words.
!pip install nltk
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop = stopwords.words('english')
Requirement already satisfied: nltk in c:\users\afelix\anaconda3\lib\site-packages (3.5) Requirement already satisfied: click in c:\users\afelix\anaconda3\lib\site-packages (from nltk) (7.1.2) Requirement already satisfied: joblib in c:\users\afelix\anaconda3\lib\site-packages (from nltk) (0.17.0) Requirement already satisfied: regex in c:\users\afelix\anaconda3\lib\site-packages (from nltk) (2020.10.15) Requirement already satisfied: tqdm in c:\users\afelix\anaconda3\lib\site-packages (from nltk) (4.50.2)
[nltk_data] Downloading package stopwords to [nltk_data] C:\Users\afelix\AppData\Roaming\nltk_data... [nltk_data] Unzipping corpora\stopwords.zip.
Then we delete the stopwords from the title column and make a new column without the stopwords.
post_data['title_without_stopwords'] = post_data['title'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
post_data.head()
| title | score | id | url | comms_num | created | upvote_ratio | body | year | month | date | day | hour | dow | title_without_stopwords | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ASMR N3tw0rk vibrating yoga mat (video works now) | 156 | duws2y | http://dupose.com/asmr-network-yoga-instructor... | 18 | 2019-11-11 18:43:54 | 0.88 | 2019 | 2019-11 | 2019-11-11 | 11 | 10 | 0 | ASMR N3tw0rk vibrating yoga mat (video works now) | |
| 1 | "Hysterical Literature: Session Twelve" (women... | 89 | 4193c2 | https://www.youtube.com/watch?v=-_8-_NoXml0 | 2 | 2016-01-16 16:37:04 | 0.95 | 2016 | 2016-01 | 2016-01-16 | 16 | 8 | 5 | "Hysterical Literature: Session Twelve" (women... | |
| 2 | cum clinic vibrator treatment | 69 | 8hoh3z | https://m.spankbang.com/261mg/video/cum+clinic... | 6 | 2018-05-07 15:45:47 | 0.92 | 2018 | 2018-05 | 2018-05-07 | 7 | 8 | 0 | cum clinic vibrator treatment | |
| 3 | [moaning] [accents] [vibrator] "cruel" orgasm ... | 60 | 6txoe4 | https://www.pornhub.com/view_video.php?viewkey... | 2 | 2017-08-15 22:20:33 | 0.96 | 2017 | 2017-08 | 2017-08-15 | 15 | 15 | 1 | [moaning] [accents] [vibrator] "cruel" orgasm ... | |
| 4 | Girl's reading distracted by vibrator in her v... | 52 | fwl5qe | https://eroasmr.com/video/girls-reading-distra... | 9 | 2020-04-07 14:10:40 | 0.87 | 2020 | 2020-04 | 2020-04-07 | 7 | 7 | 1 | Girl's reading distracted vibrator vagina |
And try again...
Counter(" ".join(post_data["title_without_stopwords"]).split()).most_common(20)
[('|', 17),
('vibrator', 9),
('ASMR', 7),
('-', 6),
('Vibrator', 5),
('Whispers', 4),
('reading', 3),
('vibrating', 2),
('cum', 2),
('[vibrator]', 2),
('Ginny', 2),
('Redhead', 2),
('Masturbation', 2),
('Moaning', 2),
('Naked', 2),
('I', 2),
('mouth', 2),
('mouth]', 2),
('[cum', 2),
('sounds', 2)]
Well done!
(as a bonus, you could turn this into a data frame if you wanted, and plot it as well! - though it's not a super interesting graph!)
from pandas import DataFrame
words_num = Counter(" ".join(post_data["title_without_stopwords"]).split()).most_common(20)
words_num_df = pd.DataFrame(words_num,columns=['word','count'])
fig = px.scatter(words_num_df, x="word", y="count", hover_name="word")
fig.show()
Okay, let's try some data that we don't necessarily think of as numeric: sound.
Let's import some libraries to help us out with sound.
! pip install pydub
! pip install scipy
Collecting pydub Downloading pydub-0.25.0-py2.py3-none-any.whl (32 kB) Installing collected packages: pydub Successfully installed pydub-0.25.0 Requirement already satisfied: scipy in c:\users\afelix\anaconda3\lib\site-packages (1.5.2) Requirement already satisfied: numpy>=1.14.5 in c:\users\afelix\anaconda3\lib\site-packages (from scipy) (1.19.2)
Now, let's import those libraries and read our file. We're directly reference the sound_sample.wav that is in your downloaded folder. And let's print the rate and the audio.
#required libraries
import scipy.io.wavfile
import pydub
rate,audData=scipy.io.wavfile.read("sound_sample.wav")
print(rate)
print(audData)
22050 [ 0 0 0 ... 410 -445 -2167]
C:\Users\afelix\anaconda3\lib\site-packages\pydub\utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
The output from the wavefile.read are the sampling rate on the track, and the audio wave data. The sampling rate represents the number of data points sampled per second in the audio file. In this case 44100 pieces of information per second make up the audio wave. This is a very common rate. The higher the rate, the better quality the audio.
Let's take a shape of the audio data a second of audio data!
#wav length
audData.shape[0] / rate
60.0
Looking at the shape of the audio data it has ONE array, so it's a mono channel.
audData.dtype
dtype('int16')
The data is stored as int16. This is the size of the data stored in each datapoint. Common storage formats are 8, 16, 32. Again the higher this is the better the audio quality
The values in the data represent the amplitude of the wave (or the loudness of the audio). The energy of the audio can be described by the sum of the absolute amplitude.
#Energy of music
np.sum(audData.astype(float)**2)
5386260809035.0
This will depend on the length of the audio, the sample rate and the volume of the audio. A better metric is power, which is energy per second...
#power - energy per unit of time
1.0/(2*(audData.size)+1)*np.sum(audData.astype(float)**2)/rate
92.31850880740502
Now, let's plot the amplitude of the track over time...
import matplotlib.pyplot as plt
#create a time variable in seconds
time = np.arange(0, float(audData.shape[0]), 1) / rate
#plot amplitude (or loudness) over time
plt.figure(1)
plt.subplot(211)
plt.plot(time, audData, linewidth=0.01, alpha=1, color='#00ff00')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
Text(0, 0.5, 'Amplitude')
Another common way to analyse audio is to create a spectogram. Audio spectograms are heat maps that show the frequencies of the sound in Hertz (Hz), the volume of the sound in Decibels (dB), against time.
plt.figure(2, figsize=(8,6))
plt.subplot(211)
Pxx, freqs, bins, im = plt.specgram(audData, Fs=rate, NFFT=1024, cmap=plt.get_cmap('viridis'))
cbar=plt.colorbar(im)
plt.xlabel('Time (s)')
plt.ylabel('Frequency (Hz)')
cbar.set_label('Intensity dB')
plt.show()
C:\Users\afelix\anaconda3\lib\site-packages\matplotlib\axes\_axes.py:7558: RuntimeWarning: divide by zero encountered in log10
The result allows us to pick out a certain frequency and examine it
np.where(freqs==10034.47265625)
MHZ10=Pxx[233,:]
plt.plot(bins, MHZ10, color='#000000')
[<matplotlib.lines.Line2D at 0x1f105411bb0>]
Okay, that's it for sound!
In the final section of thsi studio, we're going to use a mixture of matplotlib and another library imageio to examine how images work as computational data (and how they're all also secretly grids and numbers).
First, let's import imageio (matplotlib is already imported above), and drag in an image:
import imageio
#replace the link with the link to an image of your choice
pic = imageio.imread("https://www.gannett-cdn.com/-mm-/1f979b4098fb39d02336dd610b5d12a933c3f2ca/c=4-301-1376-1076/local/-/media/2018/02/27/USATODAY/USATODAY/636553434014359690-Thomas-train-at-Edaville.jpg?auto=webp&format=pjpg&width=1200")
plt.figure(figsize = (15,15))
plt.imshow(pic)
<matplotlib.image.AxesImage at 0x1f10697d4c0>
All digital images look like this (thanks Stanford for the image):

Just like your graphs above, they have an x and y axis.
Each pixel is made up of three values: red (r), green (g) and blue (b):

We will investigate this a little more in our image workshop, but for now, this provides us two ways of classifying (and so, searching through) the enormous data set that is an image: colour, and position.
First, let's check that your image is in 3 dimensions (or RGB)
print('Dimension of Image {}'.format(pic.ndim))
Dimension of Image 3
Now, let's find the RGB value of a single pixel!
rgb = pic[100, 50]
print(rgb)
[198 187 195]
Can we split the layers so each image just shows the red, green and blue values?
import numpy as np #thanks to Yassine Hamdaoui for the code
fig, ax = plt.subplots(nrows = 1, ncols=3, figsize=(15,5))
for c, ax in zip(range(3), ax):
# create zero matrix
split_img = np.zeros(pic.shape, dtype="uint8")
# 'dtype' by default: 'numpy.float64' # assing each channel
split_img[ :, :, c] = pic[ :, :, c] # display each channel
ax.imshow(split_img)
What happens if we change the r value of the rows 50 to 150 to the full 255 intensity?
import matplotlib.pyplot as plt
pic[110:300 , 630:820 , 0] = 200 # full intensity to those pixel's R channel
plt.figure( figsize = (5,5))
plt.imshow(pic)
plt.show()
And finally, let's just highlight only pixel values that are higher than 180 in the r channel!
pic = imageio.imread("https://pyxis.nymag.com/v1/imgs/38f/669/8b54861da47434cfd35f0966222f3594b9-08-breadface.rsquare.w700.jpg")
red_mask = pic[:, :, 0] > 210
pic[red_mask] = 225
plt.figure(figsize=(5,5))
plt.imshow(pic)
<matplotlib.image.AxesImage at 0x1f106274340>
That's it for today! Don't forget to post your graph or image in the #studios slack channel